This is our final projct for the subject Data Processes. The topic we chose is Climate Change.

Abstract

Climate change and its impact on the earth have been a hot issue in the last years. Many scientific researches have been carried out in order to both analyse the phenomenon and try to find a realistic solution. Recently, after becoming aware that it is now to late to find a complete solution, the scientists have been focusing more on predicting the speed of the process in order to be prepared to face the problems. Our aim in this project is to try to predict how the rising of co2 in the last years will affect the state of glaciers in the next months. In order to make this prediction we investigated three different datasets and we processed the data in them with the goal to merge them together in a unique dataset to use for our analysis. Afterwards, we performed some machine learning techniques to make our prediction.

Introduction

We would like to answer this question because Climate Change is a really important issue we have now. The poles are melting, the sea level is growing and the temperature is rising. All of these problems can change the world and the way we live. We think people must be aware of this and try to slow the process.

Exploratory Data Analysis

In this part the three datasets that we are going to use are going to be presented.

CO2 emissions

This dataset describes Trends in Atmospheric Carbon Dioxide and it contains data from 1958 to 2018. Source: The raw dataset has the following features:

  • Date: the day of the recording in yy-mm-YYYY format
  • Average [Parts per million]: The monthly mean CO2 mole fraction determined from daily averages. If there are missing days concentrated either early or late in the month, the monthly mean is corrected to the middle of the month using the average seasonal cycle. Missing months are denoted by -99.99.
  • Interpolated [Parts per million]: Values from the average column and interpolated values where data are missing. Interpolated values are computed in two steps. First, we compute for each month the average seasonal cycle in a 7-year window around each monthly value. In this way the seasonal cycle is allowed to change slowly over time. We then determine the trend value for each month by removing the seasonal cycle; this result is shown in the trend column. Trend values are linearly interpolated for missing months. The interpolated monthly mean is then the sum of the average seasonal cycle value and the trend value for the missing month.
  • Trend [Parts per million]: Seasonally corrected.
  • Number.of.Days: Number of days for the recording: -1 denotes no data for number of daily averages in the month.
{r CO2 data processing, include=FALSE}
##       Date               Average       Interpolated       Trend      
##  Min.   :1958-03-01   Min.   :313.2   Min.   :313.2   Min.   :314.6  
##  1st Qu.:1973-07-24   1st Qu.:328.9   1st Qu.:328.9   1st Qu.:329.9  
##  Median :1988-09-16   Median :351.6   Median :351.6   Median :352.2  
##  Mean   :1988-08-15   Mean   :353.9   Mean   :353.9   Mean   :353.9  
##  3rd Qu.:2003-09-08   3rd Qu.:376.0   3rd Qu.:376.0   3rd Qu.:376.5  
##  Max.   :2018-09-01   Max.   :411.2   Max.   :411.2   Max.   :409.0  
##                                                                      
##  Number.of.Days       Year    
##  Min.   :-1.00   1959   : 12  
##  1st Qu.:-1.00   1960   : 12  
##  Median :25.00   1961   : 12  
##  Mean   :18.52   1962   : 12  
##  3rd Qu.:28.00   1963   : 12  
##  Max.   :31.00   1965   : 12  
##                  (Other):648

Data Preparation

Before proceeding to use the dataset we have to do some preliminary operation:

  1. Removing non useful features such as Decimal.Date since it is a repetition of Date

  2. Converting the Date fields to R Date to be used in the plot, since ggplot2 treats it as a time series and handled differently

  3. Removing the outliers, since in Average there are a few observations that have a value of -99.99 this changes radically the scale of the plot

outlier_values <- boxplot.stats(co2data$Average)$out
outlier_values
## numeric(0)

So we simply removed them, since those observation are only 7, and compared to the size of the dataset we can remove those specific observations.

We also wanted to know which was (in terms of climate change) the ‘worst year’ for our planet? (more polluted year, higher temperature…).

Premise: By worst year here we mean that we are looking for the highest CO2 level in the data We expect that since the trend of the CO2 level is increasing year by year that the year with the highest CO2 PPM will be the last recorded one To answer this question it is necessary that we do more Data Manipulation using R package dplyr. In order to answer the question we need to retrieve the data for the year to do so we did the following steps:

  1. Create a new Data frame called co2data_year using the Date field and taking only the Year
  2. Transforming it into a Date field
  3. Group by Year
  4. Setting the variable Trend to be the Mean for that year using the summarize function from dplyr

After doing so we get the following data frame:

co2data_year <-  co2data %>% 
  group_by(Year) %>%
  summarise(Trend = median(Trend))

head(co2data_year)
## # A tibble: 6 x 2
##   Year  Trend
##   <fct> <dbl>
## 1 1958   315.
## 2 1959   316.
## 3 1960   317.
## 4 1961   318.
## 5 1962   318.
## 6 1963   319.

Now we can plot the results.

To answer this question it is enough to run this simple query using again the dplyr package. This command will return one value which is the date of the highest CO2 level

worst_year <-co2data %>% 
  filter(Trend == max(Trend))  %>% 
  select(Date)
worst_year
##         Date
## 1 2018-09-01

Let’s see a summary of the data:

International comprehensive Ocean-Atmosphere (ICOADS-NOAA)

Dataset description https://rda.ucar.edu/datasets/ds548.0/

The International Comprehensive Ocean-Atmosphere Data Set (ICOADS) is a global ocean marine meteorological and surface ocean dataset. It contains data measurements and visual observation collected by ships, moored and drifting buoys, coastal stations, and other marine and near-surface ocean platforms.
Each marine report contains individual observations of meteorological and oceanographic variables, such as sea surface and air temperatures, wind, pressure, humidity, and cloudiness. The coverage is global and sampling density varies depending on date and geographic position relative to shipping routes and ocean observing systems. This dataset was download from GoogleBigQuery, because of its size, thaat’ the reason because we only have data from 2010. The dataset consists of 76 columns and 100,000 rows, but many of the columns present a lot of missing values (in many cases more than a half of the observations), so that we had to drop many of them.

Data processing

Taking into account the general topic of interest, we decided to consider only the features regarding temperature, pressure of water, weather, clouds and waves. However, after analysing the number of NA values into the columns, we realized that most of them weren’t useful due to the high percentage of missing values (sometimes more than a half) so we deleted them. After doing this process, we also deleted the remaining NA values, by getting rid of the rows containing them. Eventually, the final dataset was composed by 9 columns and 69,050 rows.

To understand the relationship between the variables, we proceeded with the study of the correlations between the variables. By plotting the correlation matrix, we could easily see that, as expected, the variables sea_level_temp and air_temperature were highly correlated.

Another observation which could be seen is that there is a negative correlation between sea_surface_temp and latitude. To investigate more about this, plotted sea_surface_temp vs latitude and we could observ a pattern which, as expected, showed that temperature is low in the poles and increases the closer we get to the equator.

plot(newOcean$latitude, newOcean$sea_surface_temp)

Afterwards, in order to make the manipulation easier, we aggregated month and day into the variable Date. Moreover, we created a new variable, called Hemisphere, by binning the latitude.

Graphs

  1. Firstly, we choose to represent the values of the sea_level_pressure in a world map. Lower values of pressure are represented by blue colours, while higher values are represented by yellow and red colours.

  2. In order to make a comparison between the evolution of temperature during the year in the northern and southern pole, we plotted two bar charts.

The National Snow and Ice Data Center

The National Snow and Ice Data Center (NSIDC) supports research into our world’s frozen realms: the snow, ice, glaciers, frozen ground, and climate interactions that make up Earth’s cryosphere. NSIDC manages and distributes scientific data, creates tools for data access, supports data users, performs scientific research, and educates the public about the cryosphere.

Content

The dataset provides the total extent for each day for the entire time period (1978-2015). There are seven variables:

The main problem with this dataset it that measures are taken every 2 days, while the co2 are monthly.

##       Year          Month             Day            Extent      
##  Min.   :1978   Min.   : 1.000   Min.   : 1.00   Min.   : 2.080  
##  1st Qu.:1992   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 7.601  
##  Median :2001   Median : 7.000   Median :16.00   Median :12.217  
##  Mean   :2001   Mean   : 6.507   Mean   :15.74   Mean   :11.495  
##  3rd Qu.:2010   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:15.114  
##  Max.   :2019   Max.   :12.000   Max.   :31.00   Max.   :20.201  
##                                                                  
##     Missing         
##  Min.   :0.000e+00  
##  1st Qu.:0.000e+00  
##  Median :0.000e+00  
##  Mean   :3.074e-06  
##  3rd Qu.:0.000e+00  
##  Max.   :2.400e-02  
##                     
##                                                                                                                             Source.Data   
##   ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc0051_gsfc_nasateam_seaice/final-gsfc/north/daily/1978/nt_19781026_n07_v1.1_n.bin']:    1  
##   ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc0051_gsfc_nasateam_seaice/final-gsfc/north/daily/1978/nt_19781028_n07_v1.1_n.bin']:    1  
##   ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc0051_gsfc_nasateam_seaice/final-gsfc/north/daily/1978/nt_19781030_n07_v1.1_n.bin']:    1  
##   ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc0051_gsfc_nasateam_seaice/final-gsfc/north/daily/1978/nt_19781101_n07_v1.1_n.bin']:    1  
##   ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc0051_gsfc_nasateam_seaice/final-gsfc/north/daily/1978/nt_19781103_n07_v1.1_n.bin']:    1  
##   ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc0051_gsfc_nasateam_seaice/final-gsfc/north/daily/1978/nt_19781105_n07_v1.1_n.bin']:    1  
##  (Other)                                                                                                                          :26348  
##  hemisphere   
##  north:13177  
##  south:13177  
##               
##               
##               
##               
## 

For this reason, we had to: - Take a look at the possible exising cycles of ice extent during each month - See if we can compress all that information in a few features (like mean and variance or jitter or some statistical feature).

## Analysis of Variance Table
## 
## Response: Extent
##                             Df Sum Sq Mean Sq   F value    Pr(>F)    
## Date                         1   1965    1965  102.7799 < 2.2e-16 ***
## Year                         1  31833   31833 1665.3461 < 2.2e-16 ***
## Month                        1    296     296   15.5032 8.258e-05 ***
## Day                          1   1713    1713   89.5968 < 2.2e-16 ***
## Missing                      1      2       2    0.1213    0.7277    
## hemisphere                   1    461     461   24.1423 9.001e-07 ***
## `Global Extent`              1  20617   20617 1078.5827 < 2.2e-16 ***
## `North-South Difference`     1      0       0    0.0000    0.9984    
## Residuals                26345 503590      19                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thankfully, we saw almost no variation within the days, as was seen on IceSurface.R, the ice extent remains almost constant for each day of the month. This is reasonable, since it has no sense having a sensible decrease in ice extent in a certain days, since it is a rather long-termed effect.

## Don't know how to automatically pick scale for object of type table. Defaulting to continuous.

Questions of interest about Ice

The other thing we did is we provided a different columns for each one of the hemisphere, instead of having 2 rows for each observation: one for northern hemisphere and other for the southern, and added two extra columns: global extent and difference between north and south. This can be done, since all measures are an absolute value (10^6 km squared) and they are additive.

#Question 1: How was the ice surface’s evolution along all the time? As we can see, both hemisphere’s extention on ice follow a cycle per year in which they compensate each other. However, we must note that the south hemisphere suffers a way higher variations in the extent than the northern hemisphere. We can infer that these cycles correspond to the 6 month season that take place on each pole, since they have lower values during aproximately 6 months and then oscilate to a 6 months period of max values.

We can also see here the corresponding values for each hemisphere. As we can see, the south one is almost constant along the years. But the north one is significatively damaged, as we can see from the regression line.

#Question 2: Is there a cycle of ice regarding the days of the month?

As we can see, there is no obious cycle in the days, maybe a little differences in the last days, but not a great difference.

#Question 3: Is there a cycle of ice regarding months of the year? Maybe related to seasonal weather?

We can see that the seasonal effect is strong. However, for many of the months there are a lot of outliers, so it may imply that many exceptions to the normal value occurred. Maybe due to the trend along all years?

#Question 4: Is there a different amount of extention of ice between the North Hemisphere and the South Hemisphere?

We noted that before, there is.

#Question 5: Annual average of ice? we can see that the average annual amount of ice is sharply decreasing: 2.5 M km^2 lost in less than 50 years!

Methods

To be able to create a model, first we need to have one dataset. In this case, we are only going to use The National Snow and Ice Data Center and CO2.

We merged all the data via month, and calculated the mean for each month. After that, we proceeded to merge co2 dataset and ice extent dataset. This was an easy task after rearranging the times, but we were careful in the merging, so missing values are completely eliminated from both datasets.

After doing that, we explored the data. We could observe that ice variation in the South pole was almost insignificant, since the regression line was straight completely and variation remained the same. The main concern was for the north pole extent, which is greatly damaged by the co2. We made a graph representing it that can be seen below.

Strength of relationships

The relationships between Co2 and Ice extension were tested.

This plots show us an existing and negative correlation between the North hemisphere ice extension and CO2. This means that if CO2 concentrations increase, north hemisphere ice will decrease. Correlation and partial correlation were tested. <<<<<<< HEAD Correlation between CO2 and North Ice Extension was -0.21. This may lead to confounding conclusions, since the value is very low (in the 0 closeness). Thus, we also performed a Pearson partial correlation and obtained that the partial correlation was -0.447, so if all other values are controlled to not interfere in the correlation between these 2 variables. So in this case, the relationship is higher, meaining that even the relationship between the 2 variables is not linear, it exist a negative relationship between both of them.

fit <- lm(NorthIceAverageExtent ~ Trend.for.Co2.amount ,data = data)

scatterplot <- data %>%
  ggplot(aes(x = Trend.for.Co2.amount, y = NorthIceAverageExtent, color = Date)) + 
  ggtitle("Correlation between North Hemisphere Ice surface and CO2 concentration")+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_point() + 
  labs(x = "Co2 Amounts (mole fraction)", 
       y = "Ice extent (10^6 km)") +
  theme(legend.position = "none")

ggplotly(scatterplot) %>% 
  add_lines(x = ~data$Trend.for.Co2.amount, y = fitted(fit))

Correlation between CO2 and North Ice Extension was -0.21. This may lead to confounding conclusions, since the value is very low (in the 0 closeness). Thus, we also performed a Pearson partial correlation and obtained that the partial correlation was -0.447, so if all other values are controlled to not interfere in the correlation between these 2 variables. So in this case, the relationship is higher, meaining that even the relationship between the 2 variables is not linear, it exist a negative relationship between both of them >>>>>>> 0f7d2390f64a749305f8657cc66229a69f3929a3

### Predictions We want to predict the Extent in sea ice based on the amount of CO2 (Trend) present in the atmosphere. To do that, we need to merge the information from two datasets: co2 and seaice.

We are going to join them by the Date, but before doing it, we need to clean and select the columns (make sure that there is no null values), aggregate the information by year and month in seaice (because in the co2 dataset the data is aggregated by month) and create a new Date variable from these. Then, we perform an inner join to make sure that we will not have missing values.

#Load co2_data, select colums and format Date
co2_data<-read.csv("../data/co2.csv",header=TRUE, na.strings = TRUE , as.is = TRUE)
co2_data <- co2_data %>% select(-Decimal.Date) %>% select(-Number.of.Days)
co2_data$Date<-as.Date(co2_data$Date, format = "%Y-%m-%d")
print(paste0('number of missing values in the co2 dataset: ',sum(is.na(co2_data))))
## [1] "number of missing values in the co2 dataset: 0"
#Load ice_data and select columns
ice_data<-read.csv("../data/seaice.csv",header=TRUE)
ice_data<-ice_data %>% select(-Source.Data) %>% select (-Missing) %>% select (-Day)
#Grouping by year and month, sort and format date
ice_data<-ice_data %>% group_by(Year, Month) %>% summarise(Extent=mean(Extent)) %>% arrange(Year, Month)
ice_data$Date<-as.Date(paste0(ice_data$Year, "-", ice_data$Month,'-01'), format = "%Y-%m-%d")
print(paste0('number of missing values in the ice dataset: ',sum(is.na(co2_data))))
## [1] "number of missing values in the ice dataset: 0"
#Join both datasets by Date field
joined_df<-inner_join(ice_data, co2_data, by='Date')

After joining the datasets, we perform a linear model to observe how well our model fits into a line: Extent ~ Trend + Year + Month. The Multiple R-squared is telling us that this mode explains 64.5% of the variance present in the target variable.

## 
## Call:
## lm(formula = Extent ~ Trend + Year + Month, data = joined_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9162 -0.5648  0.1376  0.5993  1.8917 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.94848   65.74491  -1.338   0.1816    
## Trend        -0.04354    0.02037  -2.138   0.0331 *  
## Year          0.05676    0.03663   1.549   0.1220    
## Month         0.32889    0.01173  28.035   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.857 on 476 degrees of freedom
## Multiple R-squared:  0.6446, Adjusted R-squared:  0.6424 
## F-statistic: 287.8 on 3 and 476 DF,  p-value: < 2.2e-16

But if we convert the Month to categorical variable (using as.factor function), we get a Multiple R-squared of 0.9438; a huge improvement. This is telling us that the value of Extent depends heavily on the month of the year.

## 
## Call:
## lm(formula = Extent ~ Trend + Year + as.factor(Month), data = joined_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17899 -0.22410 -0.01598  0.22506  1.14943 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -91.419281  26.427214  -3.459 0.000591 ***
## Trend               -0.045111   0.008189  -5.509 5.98e-08 ***
## Year                 0.058889   0.014726   3.999 7.40e-05 ***
## as.factor(Month)2   -0.530093   0.077043  -6.880 1.93e-11 ***
## as.factor(Month)3    0.034173   0.077072   0.443 0.657684    
## as.factor(Month)4    1.105635   0.077134  14.334  < 2e-16 ***
## as.factor(Month)5    2.062052   0.077187  26.715  < 2e-16 ***
## as.factor(Month)6    2.890811   0.077305  37.395  < 2e-16 ***
## as.factor(Month)7    3.008874   0.077377  38.886  < 2e-16 ***
## as.factor(Month)8    2.740136   0.077531  35.343  < 2e-16 ***
## as.factor(Month)9    2.726267   0.077675  35.098  < 2e-16 ***
## as.factor(Month)10   3.529631   0.077794  45.371  < 2e-16 ***
## as.factor(Month)11   3.612928   0.078016  46.310  < 2e-16 ***
## as.factor(Month)12   2.009024   0.078187  25.695  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3445 on 466 degrees of freedom
## Multiple R-squared:  0.9438, Adjusted R-squared:  0.9422 
## F-statistic: 601.8 on 13 and 466 DF,  p-value: < 2.2e-16

After that, it is time to start applying machine learning algorithms using the caret package. We have chosen two: Knn and random forest. The first one is simple and powerful, and the second one is easy to understand and is good at handling both categorical and numerical data.

modeling_df<-joined_df %>% select(-Date) %>% select(-Average) %>% select(-Interpolated)
trainIndex <- createDataPartition(modeling_df$Extent,
                                  p = .8,       # Proportion of data used for training
                                  list = FALSE, # Return the values as a vector (as opposed to a list)
                                  times = 1     # Only create one set of training indices
)

After randomly splitting the dataset into training (80%) and testing (20%), Knn is faster at training and achieves an RMSE around 0.6.

# Subset your data into training and testing set
training_set <- modeling_df[ trainIndex, ] # Select rows with generated indices
test_set <- modeling_df[ -trainIndex, ]    # Remove rows with generated indices

# Grid for the hyperparameter tunning
rf_grid <- expand.grid(mtry=c(3,6,9,12,15))
knn_grid <- expand.grid(k=c(3,4,5,6,7))
knn_model<- train(Extent ~ Trend + Year + Month, data = training_set, 
                  method = "knn", trControl=trainControl(method = "cv", number=3), tuneGrid=knn_grid)
rf_model<- train(Extent ~ Trend + Year + as.factor(Month), data = training_set, 
                 method = "cforest", trControl=trainControl(method = "cv", number=3), tuneGrid=rf_grid)

Comparing to the standard deviation of the target variable (1.43) is not bad, but still improvable. We decided to use a random forest and include the variable Month as categorical. It takes longer to train, but we get an RMSE around 0.34, which is much better. Training is performed with cross-validation technique (k=3) and parameter tuning is done with grid search.

# Compute statistics on the target variable
summary(modeling_df$Extent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.12   10.16   11.87   11.55   12.72   14.05
sd(modeling_df$Extent)
## [1] 1.433188
# Compute the prediction error RMSE
print(paste0('RMSE of prediction with knn model in training dataset: ', RMSE(predict(knn_model, training_set), training_set$Extent)))
## [1] "RMSE of prediction with knn model in training dataset: 0.409845737371037"
print(paste0('RMSE of prediction with knn model in test dataset: ', RMSE(predict(knn_model, test_set), test_set$Extent)))
## [1] "RMSE of prediction with knn model in test dataset: 0.634989450844869"
# Compute the prediction error RMSE
print(paste0('RMSE of prediction with random forest model in training dataset: ', RMSE(predict(rf_model, training_set), training_set$Extent)))
## [1] "RMSE of prediction with random forest model in training dataset: 0.329028097172657"
print(paste0('RMSE of prediction with random forest model in test dataset: ', RMSE(predict(rf_model, test_set), test_set$Extent)))
## [1] "RMSE of prediction with random forest model in test dataset: 0.376541805951893"

We can conclude, after the linear model and the random forest, that we can find a clear pattern in the data from these two datasets; meaning that we can predict the extent of ice through co2 level, year and month of the year.

Results

Discussion and future work

Another interesting approach would have been to use time series analysis. There are interesting aspects in the data that could be understan from the perspective of the time: seasonality, trend, etc. We would have also liked to include data from the ICOADS dataset(oceans) but, unfortunately, we only had data from 2010. The ICOADS dataset is also temporal, but it is huge. The dataset is in GoogleBigQueries, so we can’t download complete. As future work it will be interesting using RSpark and use the whole dataset to try to improve the predictions with the data af the oceans.